About the project

Write a short description about the course and add a link to your GitHub repository here. This is an R Markdown (.Rmd) file so you can use R Markdown syntax.

Ensimmäinen harjoitus

Ensimmäisessä harjoituksessa installoitiin tarvittavat ohjelmat ja yhdistettiin verkkoympäristöön. Itselläni oli ongelmia saada R-studio ja GIT keskustelemaan keskenään ja tästä syystä jouduin tekemään installaatiot useaan otteeseen. Näiden ongelmien jälkeen pääsi harjoittelemaan DATACAMP ympäristössä, mutta jostain syystä platform näkyy normaalia huomattavan pienenä ja en ole tätä vielä ratkaissut.

R-Harjoitukset

R-Harjoituksissa käytiin läpi hyvin perus R-käyttöä, mutta hyvin nopeasti lopussa syöksyttiin funktioihin. Mielestäni tässä olisi ollut hyvä olla useampi videoluento aiheesta. Jokatapauksessa erittäin mielenkintoinen konsepti. En ole ennen käyttänyt GIT iä tai Markdownia , joten näistäkin on hyvä saada kokemusta.

Oma Git


Työ 2

Describe the work you have done this week and summarize your learning.

Toinen harjoitus

DATA tuonti

data <- read.csv("learing_2019.csv")

#Summary
summary(data)
##        X          gender       Age           Attitude         Points     
##  Min.   :  1.00   F:110   Min.   :17.00   Min.   :14.00   Min.   : 7.00  
##  1st Qu.: 42.25   M: 56   1st Qu.:21.00   1st Qu.:26.00   1st Qu.:19.00  
##  Median : 83.50           Median :22.00   Median :32.00   Median :23.00  
##  Mean   : 83.50           Mean   :25.51   Mean   :31.43   Mean   :22.72  
##  3rd Qu.:124.75           3rd Qu.:27.00   3rd Qu.:37.00   3rd Qu.:27.75  
##  Max.   :166.00           Max.   :55.00   Max.   :50.00   Max.   :33.00  
##       Deep            Stra            Surf      
##  Min.   :1.583   Min.   :1.250   Min.   :1.583  
##  1st Qu.:3.333   1st Qu.:2.625   1st Qu.:2.417  
##  Median :3.667   Median :3.188   Median :2.833  
##  Mean   :3.680   Mean   :3.121   Mean   :2.787  
##  3rd Qu.:4.083   3rd Qu.:3.625   3rd Qu.:3.167  
##  Max.   :4.917   Max.   :5.000   Max.   :4.333
#Access ggplot and GGally
library(ggplot2)

library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
#Creating graphics
p1 <- ggplot(data, aes(x = Attitude, y = Points, col = gender))

#and fitting linear model
p2 <- p1 + geom_point() + geom_smooth(method = "lm")
p2

In these graphs we can see that gender plays a role. Biggest difference between genders seems to be on variable attitude, where females have a clearly lower mean. The distributions of the attitude and surf variables differ between males and females.

#Correlations

p <- ggpairs(data, mapping = aes(col = gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))
p

## Regression model with multible explanatory variables
my_model <- lm(Points ~ Attitude + Stra + Surf, data = data)

# Summary of the model
summary(my_model)
## 
## Call:
## lm(formula = Points ~ Attitude + Stra + Surf, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.01711    3.68375   2.991  0.00322 ** 
## Attitude     0.33952    0.05741   5.913 1.93e-08 ***
## Stra         0.85313    0.54159   1.575  0.11716    
## Surf        -0.58607    0.80138  -0.731  0.46563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08

Only attitude seems to be siqnificant in this fitted model. The multiple R squared of the model is in this case simply the square of the regression between Attitude and Points. Since it is around 0,2, approximately 20% of the variation of points can be explained by the variation of Attitude.

# drawing diagnostic plots using the plot() function. Residuals vs Fitted values, Normal QQ-plot and Residuals vs Leverage.

par(mfrow = c(2,2))
plot(my_model, which = c(1,2,5))

The assumptions of the model are that the error terms are approximately normally distributed with mean 0 and identical variation, uncorrelated and independent of the variable of interest. Specifically, the size of the error should not depend on the value of the explanatory or interesting variables.

From these pictures it seems that the model assumptions are approximately correct, with the small exception of small and large values of Points corresponding to some larger deviation from the estimated mean. Also, a couple of observations seem to have somewhat large leverages, but overall the assumptions of the model seem to hold quite well. Questionable is the ends of the spectrum, and additional analysis is needed.


Työ 3

Exercise 3

Mette Nissen 15.9.2019 Exercise 3, analysis with student alcohol consumption.

Exercise 3 #Exercise 3

Data Wrangling

alc <- read.csv("alc.csv", header = TRUE, sep = ",")


library(ggplot2)
library(GGally)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ tibble  2.1.3     ✔ purrr   0.3.2
## ✔ tidyr   0.8.3     ✔ dplyr   0.8.3
## ✔ readr   1.3.1     ✔ stringr 1.4.0
## ✔ tibble  2.1.3     ✔ forcats 0.4.0
## ── Conflicts ────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(dplyr)
library(knitr)

Relationship between variables and high use

I have taken variables gender, number of school abscences, going out with friends and final grade in the analysis. Hypothesis is that people performing poorly in school and missing classes are in higher risk of using more alcohol.

Summary of variables

I also looked the summary and we can see that the mean age is 17 years. From the barchart we can see how high use is more common in men. THe next boxplot shows people having high use in alcohol going out in both genders. Also final grade seems to be lower. Using age as a function for abscences, we can see in the last figure how abscences seem to be higher in older men with high alcohol consumption.

Same things we can see from mean values. Group has 198 female and 184 male. Dividing groups with high use in females no high use vs high use is 156 and 42 respectevely and same numbers in men are 112 and 72, so higher proportion in men are high users. High use doesn´t affect final grade in female, but we can se a difference in grades in male students: 12.2 in non-high use group and 10.3 in high use group. Abscences are higher in both gender with high use and same is seen in going out with friends see tabels mean abscences and mean goout).

#Summary of the datatable

kable(summary(alc), digits = 2)
X school sex age address famsize Pstatus Medu Fedu Mjob Fjob reason nursery internet guardian traveltime studytime failures schoolsup famsup paid activities higher romantic famrel freetime goout Dalc Walc health absences G1 G2 G3 alc_use high_use
Min. : 1.00 GP:342 F:198 Min. :15.00 R: 81 GT3:278 A: 38 Min. :0.000 Min. :0.000 at_home : 53 at_home : 16 course :140 no : 72 no : 58 father: 91 Min. :1.000 Min. :1.000 Min. :0.0000 no :331 no :144 no :205 no :181 no : 18 no :261 Min. :1.000 Min. :1.00 Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000 Min. : 0.0 Min. : 2.00 Min. : 4.00 Min. : 0.00 Min. :1.000 Mode :logical
1st Qu.: 96.25 MS: 40 M:184 1st Qu.:16.00 U:301 LE3:104 T:344 1st Qu.:2.000 1st Qu.:2.000 health : 33 health : 17 home :110 yes:310 yes:324 mother:275 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:0.0000 yes: 51 yes:238 yes:177 yes:201 yes:364 yes:121 1st Qu.:4.000 1st Qu.:3.00 1st Qu.:2.000 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:3.000 1st Qu.: 1.0 1st Qu.:10.00 1st Qu.:10.00 1st Qu.:10.00 1st Qu.:1.000 FALSE:268
Median :191.50 NA NA Median :17.00 NA NA NA Median :3.000 Median :3.000 other :138 other :211 other : 34 NA NA other : 16 Median :1.000 Median :2.000 Median :0.0000 NA NA NA NA NA NA Median :4.000 Median :3.00 Median :3.000 Median :1.000 Median :2.000 Median :4.000 Median : 3.0 Median :12.00 Median :12.00 Median :12.00 Median :1.500 TRUE :114
Mean :191.50 NA NA Mean :16.59 NA NA NA Mean :2.806 Mean :2.565 services: 96 services:107 reputation: 98 NA NA NA Mean :1.448 Mean :2.037 Mean :0.2016 NA NA NA NA NA NA Mean :3.937 Mean :3.22 Mean :3.113 Mean :1.482 Mean :2.296 Mean :3.573 Mean : 4.5 Mean :11.49 Mean :11.47 Mean :11.46 Mean :1.889 NA
3rd Qu.:286.75 NA NA 3rd Qu.:17.00 NA NA NA 3rd Qu.:4.000 3rd Qu.:4.000 teacher : 62 teacher : 31 NA NA NA NA 3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.:0.0000 NA NA NA NA NA NA 3rd Qu.:5.000 3rd Qu.:4.00 3rd Qu.:4.000 3rd Qu.:2.000 3rd Qu.:3.000 3rd Qu.:5.000 3rd Qu.: 6.0 3rd Qu.:14.00 3rd Qu.:14.00 3rd Qu.:14.00 3rd Qu.:2.500 NA
Max. :382.00 NA NA Max. :22.00 NA NA NA Max. :4.000 Max. :4.000 NA NA NA NA NA NA Max. :4.000 Max. :4.000 Max. :3.0000 NA NA NA NA NA NA Max. :5.000 Max. :5.00 Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000 Max. :45.0 Max. :18.00 Max. :18.00 Max. :18.00 Max. :5.000 NA
# Graphs

p <- ggpairs(alc, columns = c("sex", "absences","goout", "G3", "high_use"), mapping = aes(col = sex, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))
p

?ggpairs
# initialize a barchart of alcohol use difference between genders
g1 <- ggplot(data = alc, aes(x=sex)) + geom_bar() + facet_wrap("high_use") + ggtitle("Student alcohol consumption by sex")
g1

# initialize a plot of alcohol use and going out with friends
g2 <- ggplot() + geom_boxplot(data = alc, aes(x = sex, y = goout)) + facet_wrap("high_use") + ggtitle("Student going out with friends by alcohol consumption and sex")
g2

# Boxplot of alcohol use and school final grade
g3 <- ggplot() + geom_boxplot(data = alc, aes(x = sex, y = G3)) + facet_wrap("high_use") + ggtitle("Student final grades by alcohol consumption and sex")
g3

# Scatterplot showing linear model between age and abscences separated by alcohol consumption
g4 <- ggplot(data = alc, aes(x = age, y = absences, color=sex, alpha = 0.3)) + geom_point() + facet_wrap("high_use")+ geom_jitter() + geom_smooth(method = "lm") + ggtitle("Student absences by alcohol consumption, sex and age")
g4

alc %>% group_by(sex) %>% summarise(count = n())
## # A tibble: 2 x 2
##   sex   count
##   <fct> <int>
## 1 F       198
## 2 M       184
alc %>% group_by(sex, high_use) %>% summarise(count = n(), mean_grade = mean(G3))
## # A tibble: 4 x 4
## # Groups:   sex [2]
##   sex   high_use count mean_grade
##   <fct> <lgl>    <int>      <dbl>
## 1 F     FALSE      156       11.4
## 2 F     TRUE        42       11.7
## 3 M     FALSE      112       12.2
## 4 M     TRUE        72       10.3
alc %>% group_by(sex, high_use) %>% summarise(count = n(), mean_absences = mean(absences))
## # A tibble: 4 x 4
## # Groups:   sex [2]
##   sex   high_use count mean_absences
##   <fct> <lgl>    <int>         <dbl>
## 1 F     FALSE      156          4.22
## 2 F     TRUE        42          6.79
## 3 M     FALSE      112          2.98
## 4 M     TRUE        72          6.12
alc %>% group_by(sex, high_use) %>% summarise(count = n(), mean_goout = mean(goout))
## # A tibble: 4 x 4
## # Groups:   sex [2]
##   sex   high_use count mean_goout
##   <fct> <lgl>    <int>      <dbl>
## 1 F     FALSE      156       2.96
## 2 F     TRUE        42       3.36
## 3 M     FALSE      112       2.71
## 4 M     TRUE        72       3.93
knitr::opts_chunk$set(echo = TRUE)

Logistic regression model

# glm() model
m <- glm(high_use ~ absences + sex + goout, data = alc, family = "binomial")

# the summary of the model
summary(m)
## 
## Call:
## glm(formula = high_use ~ absences + sex + goout, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7871  -0.8153  -0.5446   0.8357   2.4740  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -4.16317    0.47506  -8.764  < 2e-16 ***
## absences     0.08418    0.02237   3.764 0.000167 ***
## sexM         0.95872    0.25459   3.766 0.000166 ***
## goout        0.72981    0.11970   6.097 1.08e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 387.75  on 378  degrees of freedom
## AIC: 395.75
## 
## Number of Fisher Scoring iterations: 4
# the coefficients of the model
coef(m)
## (Intercept)    absences        sexM       goout 
## -4.16317087  0.08418399  0.95871896  0.72980859
# the odds ratio (OR)
OR <- coef(m) %>% exp

# the confidence interval (CI)
CI <- confint(m) %>% exp
## Waiting for profiling to be done...
OR
## (Intercept)    absences        sexM       goout 
##  0.01555815  1.08782902  2.60835292  2.07468346
CI
##                   2.5 %     97.5 %
## (Intercept) 0.005885392 0.03804621
## absences    1.042458467 1.13933894
## sexM        1.593132148 4.33151387
## goout       1.650182481 2.64111050
# printing out the odds ratios with their confidence intervals
cbind(OR, CI)
##                     OR       2.5 %     97.5 %
## (Intercept) 0.01555815 0.005885392 0.03804621
## absences    1.08782902 1.042458467 1.13933894
## sexM        2.60835292 1.593132148 4.33151387
## goout       2.07468346 1.650182481 2.64111050

Analysis of GLM

From the model alone we can see that all other but G3 are statistically highly significant. Calculating OR and CI we can see that absences, male sex and going out have positive correlation and confidence interval staying above 1 stating significant correlation. In G3 correlation is negative and not significant. For the future predictions I am going to leave G3 from the model.

Prediction with the model

# probability of high_use
probabilities <- predict(m, type = "response")

# adding the predicted probabilities to 'alc'
alc <- mutate(alc, probability = probabilities)

# using the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = (probability > 0.5))

# the last ten original classes, predicted probabilities, and class predictions
select(alc, goout, absences, sex, high_use, probability, prediction) %>% tail(10)
##     goout absences sex high_use probability prediction
## 373     2        0   M    FALSE  0.14869987      FALSE
## 374     3        7   M     TRUE  0.39514446      FALSE
## 375     3        1   F    FALSE  0.13129452      FALSE
## 376     3        6   F    FALSE  0.18714923      FALSE
## 377     2        2   F    FALSE  0.07342805      FALSE
## 378     4        2   F    FALSE  0.25434555      FALSE
## 379     2        2   F    FALSE  0.07342805      FALSE
## 380     1        3   F    FALSE  0.03989428      FALSE
## 381     5        4   M     TRUE  0.68596604       TRUE
## 382     1        2   M     TRUE  0.09060457      FALSE
# target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   253   15
##    TRUE     65   49

The last ten cases show real values in high use TRUE 3 and FALSE 7. In the prediction these numbers are TRUE 1 and FALSE 9.

Model predictions in graphics

# a plot of 'high_use' versus 'probability' in 'alc'
pr <- ggplot(alc, aes(x = probability, y = high_use, col = prediction))
pr2 <- pr + geom_point() + ggtitle("Predictions")
pr2

Cross validation

Here is calculated the error of our model with cross-validation:

# the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)%>%prop.table %>% addmargins
##         prediction
## high_use      FALSE       TRUE        Sum
##    FALSE 0.66230366 0.03926702 0.70157068
##    TRUE  0.17015707 0.12827225 0.29842932
##    Sum   0.83246073 0.16753927 1.00000000
# a loss function 
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

# loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2094241

It seems that the wrong prediction proportion in this model 21% is smaller than 26% in the dataCamp exercise.


Työ 4

Exercise 4

# access packages

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
library(ggplot2)
library(GGally)
library(tidyverse)
library(dplyr)
library(knitr)
library(corrplot)
## corrplot 0.84 loaded
library(tidyr)
library(reshape2)
## 
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
## 
##     smiths
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
# load the data
data("Boston")

Summary of the Boston Data

This data frame contains the following columns: crim = per capita crime rate by town. zn = proportion of residential land zoned for lots over 25,000 sq.ft. indus = proportion of non-retail business acres per town. chas = Charles River dummy variable (= 1 if tract bounds river; 0 otherwise). nox = nitrogen oxides concentration (parts per 10 million). rm = average number of rooms per dwelling. age = proportion of owner-occupied units built prior to 1940. dis = weighted mean of distances to five Boston employment centres. rad = index of accessibility to radial highways. tax = full-value property-tax rate per $10,000. ptratio = pupil-teacher ratio by town. black = 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town. lstat = lower status of the population (percent). medv = median value of owner-occupied homes in $1000s.

In this weeks exercise we use Boston data set from R MASS package which is a histoical data collected from 606 districts in the area around Boston.

Boston has 14 variables and 506 observations. Crime variable is the response variable.

Variables and their explanations are show above.

#Dataset summary and variables 

summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
colnames(Boston)
##  [1] "crim"    "zn"      "indus"   "chas"    "nox"     "rm"      "age"    
##  [8] "dis"     "rad"     "tax"     "ptratio" "black"   "lstat"   "medv"
pairs(Boston)

head(Boston)
##      crim zn indus chas   nox    rm  age    dis rad tax ptratio  black
## 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90
## 2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90
## 3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83
## 4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63
## 5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90
## 6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7 394.12
##   lstat medv
## 1  4.98 24.0
## 2  9.14 21.6
## 3  4.03 34.7
## 4  2.94 33.4
## 5  5.33 36.2
## 6  5.21 28.7
#Graphical summary of crime variable

ggplot(Boston, aes(crim)) + stat_density() + theme_bw()

#Plotting each variable against crime rate

bosmelt <- melt(Boston, id="crim")
ggplot(bosmelt, aes(x=value, y=crim))+
  facet_wrap(~variable, scales="free")+
  geom_point()

Graphical overview and summaries of variables

boxplot(Boston$crim, Boston$zn, Boston$indus, Boston$chas, Boston$nox, Boston$rm, Boston$age, Boston$dis, Boston$rad, Boston$tax, Boston$ptratio, Boston$black, Boston$lstat, Boston$medv, names = c("crim", "zn", "indus", "chas", "nox", "rm", "age", "dis", "rad", "tax", "ptratio", "black", "lstat", "medv"))

Making multible linear Regression model with all variables

mlm <- lm(formula = crim ~ ., data = Boston)
summary(mlm)
## 
## Call:
## lm(formula = crim ~ ., data = Boston)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.924 -2.120 -0.353  1.019 75.051 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  17.033228   7.234903   2.354 0.018949 *  
## zn            0.044855   0.018734   2.394 0.017025 *  
## indus        -0.063855   0.083407  -0.766 0.444294    
## chas         -0.749134   1.180147  -0.635 0.525867    
## nox         -10.313535   5.275536  -1.955 0.051152 .  
## rm            0.430131   0.612830   0.702 0.483089    
## age           0.001452   0.017925   0.081 0.935488    
## dis          -0.987176   0.281817  -3.503 0.000502 ***
## rad           0.588209   0.088049   6.680 6.46e-11 ***
## tax          -0.003780   0.005156  -0.733 0.463793    
## ptratio      -0.271081   0.186450  -1.454 0.146611    
## black        -0.007538   0.003673  -2.052 0.040702 *  
## lstat         0.126211   0.075725   1.667 0.096208 .  
## medv         -0.198887   0.060516  -3.287 0.001087 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.439 on 492 degrees of freedom
## Multiple R-squared:  0.454,  Adjusted R-squared:  0.4396 
## F-statistic: 31.47 on 13 and 492 DF,  p-value: < 2.2e-16

Multiple linear regression model intepratation

Most significant variables in the model are dis and rad with high significance, median value with moderate significance and zone, black with lower but still under p 0.05 significance.

Including Correlations

cor_matrix<-cor(Boston) 
cor_matrix %>% round(digits = 2)
##          crim    zn indus  chas   nox    rm   age   dis   rad   tax
## crim     1.00 -0.20  0.41 -0.06  0.42 -0.22  0.35 -0.38  0.63  0.58
## zn      -0.20  1.00 -0.53 -0.04 -0.52  0.31 -0.57  0.66 -0.31 -0.31
## indus    0.41 -0.53  1.00  0.06  0.76 -0.39  0.64 -0.71  0.60  0.72
## chas    -0.06 -0.04  0.06  1.00  0.09  0.09  0.09 -0.10 -0.01 -0.04
## nox      0.42 -0.52  0.76  0.09  1.00 -0.30  0.73 -0.77  0.61  0.67
## rm      -0.22  0.31 -0.39  0.09 -0.30  1.00 -0.24  0.21 -0.21 -0.29
## age      0.35 -0.57  0.64  0.09  0.73 -0.24  1.00 -0.75  0.46  0.51
## dis     -0.38  0.66 -0.71 -0.10 -0.77  0.21 -0.75  1.00 -0.49 -0.53
## rad      0.63 -0.31  0.60 -0.01  0.61 -0.21  0.46 -0.49  1.00  0.91
## tax      0.58 -0.31  0.72 -0.04  0.67 -0.29  0.51 -0.53  0.91  1.00
## ptratio  0.29 -0.39  0.38 -0.12  0.19 -0.36  0.26 -0.23  0.46  0.46
## black   -0.39  0.18 -0.36  0.05 -0.38  0.13 -0.27  0.29 -0.44 -0.44
## lstat    0.46 -0.41  0.60 -0.05  0.59 -0.61  0.60 -0.50  0.49  0.54
## medv    -0.39  0.36 -0.48  0.18 -0.43  0.70 -0.38  0.25 -0.38 -0.47
##         ptratio black lstat  medv
## crim       0.29 -0.39  0.46 -0.39
## zn        -0.39  0.18 -0.41  0.36
## indus      0.38 -0.36  0.60 -0.48
## chas      -0.12  0.05 -0.05  0.18
## nox        0.19 -0.38  0.59 -0.43
## rm        -0.36  0.13 -0.61  0.70
## age        0.26 -0.27  0.60 -0.38
## dis       -0.23  0.29 -0.50  0.25
## rad        0.46 -0.44  0.49 -0.38
## tax        0.46 -0.44  0.54 -0.47
## ptratio    1.00 -0.18  0.37 -0.51
## black     -0.18  1.00 -0.37  0.33
## lstat      0.37 -0.37  1.00 -0.74
## medv      -0.51  0.33 -0.74  1.00
corrplot.mixed(cor_matrix, number.cex = .6)

Corrplot shows the relationships between variables. Highest positive correlation are between rad and tax, indus and nox and age and nox. Highest negative correlations are between age and dis, lstat and med and dis and nox. Wee can see from the summaries that distribution of the variables is very inconsistent and thus we need to scale the dataset before doing linear discriminant analysis later.

Standardizing the dataset

# center and standardize variables
boston_scaled <- scale(Boston)

# summaries of the scaled variables
summary(boston_scaled)
##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865
# change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)

summary(boston_scaled)
##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865

Scaling effects

With standardizing data is centralized. This is done to continuous variables on unit scale by subtracting the mean of the variable and dividing the result by the variable’s standard deviation. With this variables´mean is 0 and SD is 1.

Creating categorical variable of the crime rate

# creating a quantile vector of crim 
bins <- quantile(boston_scaled$crim)

crime <- cut(boston_scaled$crim, breaks = bins, labels = c("low", "med_low", "med_high", "high"), include.lowest = TRUE)

table(crime)
## crime
##      low  med_low med_high     high 
##      127      126      126      127
#removing crim
boston_scaled <- dplyr::select(boston_scaled, -crim)

#adding categorical variable to the table
boston_scaled <- data.frame(boston_scaled, crime)

Creating training set and set test

For predicting with data we need a model training set which is in this case decided to be 80% of the cases and the rest of the data is used as a test set which shows the accuracy of the model.

n <- nrow(boston_scaled)

#Choosing 80% to the training set
ind <- sample(n,  size = n * 0.8)

train <- boston_scaled[ind,]

# creating the test set 
test <- boston_scaled[-ind,]

Fitting linear discriminant analysis on the training set using crime rate as the target variable

# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)

# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2599010 0.2252475 0.2500000 0.2648515 
## 
## Group means:
##                   zn      indus        chas        nox           rm
## low       0.89912056 -0.8570457 -0.08484810 -0.8596159  0.431575407
## med_low  -0.05305028 -0.3461630 -0.05600488 -0.5976699 -0.101719055
## med_high -0.38365582  0.2012935  0.19544522  0.4303616  0.006974305
## high     -0.48724019  1.0170108 -0.08835242  1.0620794 -0.464230693
##                 age        dis        rad        tax     ptratio
## low      -0.8585189  0.7392015 -0.6832698 -0.7467001 -0.45104413
## med_low  -0.3350773  0.4316354 -0.5489875 -0.4877512 -0.08402393
## med_high  0.3863600 -0.3600747 -0.4099119 -0.2896182 -0.26163461
## high      0.8257121 -0.8686921  1.6392096  1.5148289  0.78203563
##                black       lstat        medv
## low       0.38468514 -0.75240467  0.53107331
## med_low   0.31538347 -0.16205421  0.01650618
## med_high  0.08704716  0.03677049  0.10171794
## high     -0.84104816  0.90847803 -0.71620500
## 
## Coefficients of linear discriminants:
##                 LD1         LD2         LD3
## zn       0.08955590  0.64336783 -1.00506708
## indus   -0.05522375 -0.20231081  0.25460518
## chas    -0.09660283 -0.04070963  0.04332074
## nox      0.36638439 -0.85532817 -1.19613326
## rm      -0.10816053 -0.07753404 -0.12069285
## age      0.24969515 -0.30115351  0.13354400
## dis     -0.09930625 -0.30957524  0.59826457
## rad      3.14491919  1.00776866 -0.18742432
## tax      0.08295319 -0.07652879  0.69360956
## ptratio  0.10923113 -0.03283484 -0.27277266
## black   -0.13462280  0.03461104  0.08613158
## lstat    0.22463139 -0.23505020  0.38159522
## medv     0.17914980 -0.43591027 -0.06716101
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9549 0.0345 0.0106

LDA biplot

# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(train$crime)

# plot the lda results
l <- plot(lda.fit, dimen = 2, col = classes, pch = classes)
l + lda.arrows(lda.fit, myscale = 1)

## integer(0)

Prediction based on the lda model

# saving the correct classes from test data
correct_classes <- test$crime

# removing the crime variable from test data
test <- dplyr::select(test, -crime)

# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)

# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       17       3        2    0
##   med_low    4      20       11    0
##   med_high   1       4       19    1
##   high       0       0        0   20

From the cross table we can see that high values are predicted very nicely, but in the lower classes more errors occure.

Clustering, distances with euclidean distance

# Boston dataset reading and standardization again

data("Boston")
b_boston_scaled <- scale(Boston)

# Distances with euclidean distance
dist_eu <- dist(b_boston_scaled)
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970

Clustering with K means with 3 cluster centers

km <- kmeans(b_boston_scaled, centers = 3)
set.seed(123)
k_max <- 10
twcss <- sapply(1:k_max, function(k){kmeans(b_boston_scaled, k)$tot.withinss})
qplot(x = 1:k_max, y = twcss, geom = 'line')

The optimal cluster size is the point where the line drops. In this it seems to be two.

# Clustering again
km2 <- kmeans(b_boston_scaled, centers = 2)
pairs(b_boston_scaled[,1:7], col = km2$cluster)

pairs(b_boston_scaled[,8:14], col = km2$cluster)

Bonus

km3 <- kmeans(Boston, centers = 3)
set.seed(123)
k_max <- 10
twcss <- sapply(1:k_max, function(k){kmeans(Boston, k)$tot.withinss})
qplot(x = 1:k_max, y = twcss, geom = 'line')

km4 <- kmeans(Boston, centers = 2)
pairs(Boston[,1:7], col = km4$cluster)

pairs(Boston[,8:14], col = km4$cluster)

bins <- quantile(Boston$crim)

crime2 <- cut(Boston$crim, breaks = bins, labels = c("low", "med_low", "med_high", "high"), include.lowest = TRUE)

table(crime2)
## crime2
##      low  med_low med_high     high 
##      127      126      126      127
Boston <- dplyr::select(Boston, -crim)

Boston <- data.frame(Boston, crime2)

u <- nrow(Boston)

ind2 <- sample(u,  size = u * 0.8)

train2 <- Boston[ind2,]

test2 <- Boston[-ind2,]

lda.fit2 <- lda(crime2 ~ ., data = train2)

lda.fit2
## Call:
## lda(crime2 ~ ., data = train2)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2524752 0.2500000 0.2351485 0.2623762 
## 
## Group means:
##                 zn     indus       chas       nox       rm      age
## low      33.348039  5.173333 0.03921569 0.4534382 6.584784 43.77549
## med_low   8.094059  9.092871 0.04950495 0.4921772 6.197436 59.68119
## med_high  2.547368 12.335158 0.15789474 0.6040526 6.402337 81.93684
## high      0.000000 18.100000 0.05660377 0.6798208 6.005179 91.18208
##               dis       rad      tax  ptratio    black     lstat     medv
## low      5.657954  3.490196 282.8333 17.54412 390.9058  7.175784 27.00098
## med_low  4.467285  4.742574 321.8812 18.31683 385.1081 11.669109 22.51485
## med_high 2.911674  5.736842 349.4211 17.51789 364.4000 12.678211 24.52947
## high     2.027506 24.000000 666.0000 20.20000 278.7887 18.729623 15.95849
## 
## Coefficients of linear discriminants:
##                   LD1           LD2           LD3
## zn       0.0059594595  0.0291392963  -0.043654152
## indus    0.0116837992 -0.0194328871   0.028512310
## chas    -0.4398527204 -0.4347835204  -0.196120764
## nox      1.9916327959 -5.6218782002 -10.346653879
## rm      -0.1892416869 -0.1153374711  -0.261021540
## age      0.0080682351 -0.0142044048  -0.007142634
## dis     -0.0553948278 -0.0682297733   0.054422317
## rad      0.4673782227  0.1037951245  -0.004063668
## tax      0.0004496804 -0.0003602709   0.002638328
## ptratio  0.0678345162  0.0524262621  -0.082236100
## black   -0.0015035340  0.0003945387   0.001896470
## lstat    0.0294406326 -0.0448026121   0.058101833
## medv     0.0234646334 -0.0462138320  -0.009722294
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9645 0.0275 0.0081
lda.arrows2 <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}
classes2 <- as.numeric(train2$crime2)

# plot the lda results
l <- plot(lda.fit2, dimen = 2, col = classes2, pch = classes2)
l + lda.arrows2(lda.fit2, myscale = 2)
## Warning in arrows(x0 = 0, y0 = 0, x1 = myscale * heads[, choices[1]], y1 =
## myscale * : zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(x0 = 0, y0 = 0, x1 = myscale * heads[, choices[1]], y1 =
## myscale * : zero-length arrow is of indeterminate angle and so skipped

## integer(0)

Nox and seems to be the most influencal linear separators in analysis without standardization.

Super Bonus

I don´t get the colors right. Otherwise nice

model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404  13
dim(lda.fit$scaling)
## [1] 13  3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)

plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = test$classes)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = km$clusters)

Chapter 5. Dimensionality reduction techniques

This week we are using human dataset from the United Nations Development Programme. From this dataset we have selected 8 variables wich are: country - Country Edu2.FM - secundry education rate female to male ratio Labo.FM - Labour force participation rate female to male ratio Life.Exp - Life expectancy at birth GNI - Gender Development Index Mat.Mor - Maternal mortality ratio Ado.Birth - Adolescent birth rate Parli.F - Percent of female representation in Parliament

Data exploration

human <- read.csv("data/human.csv", row.names = 1)

library(GGally)
library(tidyverse)
library(corrplot)
library(dplyr)
library(knitr)
library(corrplot)
library(tidyr)
library(reshape2)
library(plotly)
library(FactoMineR)
library(ggplot2)


str(human)
## 'data.frame':    148 obs. of  8 variables:
##  $ Edu2.FM  : num  1.007 0.997 0.983 0.989 0.969 ...
##  $ Labo.FM  : num  0.891 0.819 0.825 0.884 0.829 ...
##  $ Edu.Exp  : num  17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
##  $ Life.Exp : num  81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
##  $ GNI      : int  64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
##  $ Mat.Mor  : int  4 6 6 5 6 7 9 28 11 8 ...
##  $ Ado.Birth: num  7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
##  $ Parli.F  : num  39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
summary(human)
##     Edu2.FM          Labo.FM          Edu.Exp         Life.Exp    
##  Min.   :0.1980   Min.   :0.1857   Min.   : 7.00   Min.   :49.00  
##  1st Qu.:0.8097   1st Qu.:0.5971   1st Qu.:11.57   1st Qu.:68.38  
##  Median :0.9444   Median :0.7504   Median :13.55   Median :74.35  
##  Mean   :0.8766   Mean   :0.7006   Mean   :13.42   Mean   :72.44  
##  3rd Qu.:0.9990   3rd Qu.:0.8385   3rd Qu.:15.32   3rd Qu.:77.45  
##  Max.   :1.4967   Max.   :1.0380   Max.   :20.20   Max.   :83.50  
##       GNI            Mat.Mor        Ado.Birth         Parli.F     
##  Min.   :   680   Min.   :  1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  5190   1st Qu.: 11.0   1st Qu.: 12.18   1st Qu.:12.15  
##  Median : 12408   Median : 45.5   Median : 31.30   Median :19.50  
##  Mean   : 18402   Mean   :120.9   Mean   : 43.72   Mean   :20.95  
##  3rd Qu.: 25350   3rd Qu.:170.0   3rd Qu.: 69.05   3rd Qu.:27.82  
##  Max.   :123124   Max.   :730.0   Max.   :175.60   Max.   :57.50
pairs <- ggpairs(human)
pairs

# compute the correlation matrix and visualize it with corrplot
cor(human) %>% corrplot()

From the summary we can see that distributions differ between variables. GNI min is 680 and max is 123124 with comparison of Labo.FM with range of 0.1857-1.0380. In the dataset values are different. Correlation between variables differ from strongly positive correlation (maternal mortality - rate of adolescent women giving birth) to negative corelation (maternal mortality - life expectancy).

Principal component analysis (PCA) on the not standardized human data

#PCA analysis
pca_human <- prcomp(human)
pca_human
## Standard deviations (1, .., p=8):
## [1] 1.862483e+04 1.420218e+02 2.116337e+01 1.147634e+01 3.608478e+00
## [6] 1.571358e+00 1.923557e-01 1.512244e-01
## 
## Rotation (n x k) = (8 x 8):
##                     PC1           PC2           PC3           PC4
## Edu2.FM   -4.649714e-06  0.0006867686 -0.0013662633  2.317587e-04
## Labo.FM   -9.719607e-08 -0.0003158778  0.0003015743  4.425582e-03
## Edu.Exp   -8.706739e-05  0.0087580112  0.0053006118  3.255666e-02
## Life.Exp  -2.533543e-04  0.0333850374 -0.0048336882  7.494569e-02
## GNI       -9.999893e-01 -0.0046050756 -0.0003728181 -6.050839e-05
## Mat.Mor    4.477844e-03 -0.9863064242  0.1609608588  1.124047e-02
## Ado.Birth  1.111169e-03 -0.1611663247 -0.9864335638 -3.008031e-02
## Parli.F   -5.580863e-05  0.0034657989 -0.0314144646  9.961287e-01
##                     PC5           PC6           PC7           PC8
## Edu2.FM   -0.0055155087  1.762118e-02  6.384062e-01  7.694765e-01
## Labo.FM    0.0052146757  3.217052e-02  7.688482e-01 -6.385848e-01
## Edu.Exp    0.1415081241  9.886674e-01 -3.580796e-02  8.073940e-03
## Life.Exp   0.9861585472 -1.437809e-01  4.420684e-03  6.632638e-03
## GNI       -0.0001023982 -2.864828e-05 -1.086468e-06 -1.752803e-06
## Mat.Mor    0.0337070459  2.675068e-03  1.432260e-04  1.224251e-03
## Ado.Birth  0.0039566221  7.122449e-03 -7.522696e-04 -1.109192e-03
## Parli.F   -0.0791032655 -2.145731e-02 -2.750969e-03  1.847856e-03
#Summary of principal component analysis
s <- summary(pca_human)

# rounded percentages of variance captured by each PC
pca_pr <- round(100*s$importance[2,], digits = 1)

# creating an object pc_lab to be used as axis labels
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")

# drawing a biplot of the principal component representation and the original variables
biplot(pca_human, choices = 1:2, cex = c(0.8, 1), col = c("grey40", "deeppink"), xlab = pc_lab[1], ylab = pc_lab[2], main = (title = "PCA_non-scaled"))
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

From not scaled data pca is not useful because of large impact of GNI which has the largest SD and resulting first components 100% of variance. Therefore we must scale the data to make valid analysis.

PCA on standardized data

# standardize the variables
human_std <- scale(human)
summary(human_std)
##     Edu2.FM           Labo.FM           Edu.Exp            Life.Exp      
##  Min.   :-3.1089   Min.   :-2.6159   Min.   :-2.42684   Min.   :-3.0723  
##  1st Qu.:-0.3065   1st Qu.:-0.5256   1st Qu.:-0.69805   1st Qu.:-0.5329  
##  Median : 0.3108   Median : 0.2531   Median : 0.04826   Median : 0.2503  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.5610   3rd Qu.: 0.7005   3rd Qu.: 0.71899   3rd Qu.: 0.6566  
##  Max.   : 2.8414   Max.   : 1.7144   Max.   : 2.56114   Max.   : 1.4495  
##       GNI             Mat.Mor          Ado.Birth          Parli.F       
##  Min.   :-0.9515   Min.   :-0.7355   Min.   :-1.1573   Min.   :-1.8197  
##  1st Qu.:-0.7094   1st Qu.:-0.6742   1st Qu.:-0.8466   1st Qu.:-0.7643  
##  Median :-0.3218   Median :-0.4626   Median :-0.3333   Median :-0.1258  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.3730   3rd Qu.: 0.3009   3rd Qu.: 0.6799   3rd Qu.: 0.5973  
##  Max.   : 5.6228   Max.   : 3.7352   Max.   : 3.5397   Max.   : 3.1749
#PCA analysis with scaled variables
pca_human_std <- prcomp(human_std)
pca_human_std
## Standard deviations (1, .., p=8):
## [1] 2.0333042 1.1429035 0.8966375 0.7974445 0.7016064 0.5533825 0.4782085
## [8] 0.3039769
## 
## Rotation (n x k) = (8 x 8):
##                   PC1         PC2         PC3         PC4        PC5
## Edu2.FM   -0.32465414  0.09227181 -0.43377536  0.71962839 -0.3646631
## Labo.FM    0.02176326  0.72972518 -0.51145278 -0.19826256  0.3349423
## Edu.Exp   -0.42571462  0.14257522 -0.03704407 -0.14368290  0.1144547
## Life.Exp  -0.44696256 -0.02108739  0.13444430 -0.07252202  0.1544074
## GNI       -0.36191020  0.03284391 -0.10502738 -0.55841503 -0.6927543
## Mat.Mor    0.44847613  0.16196165 -0.07472686 -0.20980104 -0.2594996
## Ado.Birth  0.41583795  0.14603035 -0.06162895  0.13254030 -0.3781145
## Parli.F   -0.08992529  0.62416308  0.71441898  0.20859522 -0.1663542
##                    PC6         PC7         PC8
## Edu2.FM   -0.102011694  0.06004801  0.18184654
## Labo.FM   -0.113322727 -0.17601955 -0.10061935
## Edu.Exp    0.611088764  0.62425612  0.01404528
## Life.Exp   0.296297409 -0.63847349  0.50711229
## GNI       -0.176468208 -0.08563262 -0.16340648
## Mat.Mor    0.004512949  0.24190007  0.77276191
## Ado.Birth  0.683004537 -0.31602881 -0.27395047
## Parli.F   -0.133691243  0.04841963 -0.02315017
# rounded percentages of variance captured by each PC
s_st <- summary(pca_human_std)
pca_pr_st <- round(100*s_st$importance[2,], digits = 1)

# creating an object pc_lab to be used as axis labels
pc_lab_st <- paste0(names(pca_pr_st), " (", pca_pr_st, "%)")

# drawing a biplot of the principal component representation and the original variables
biplot(pca_human_std, choices = 1:2, cex = c(0.6, 0.8), col = c("grey40", "blue"), xlab = pc_lab_st[1], ylab = pc_lab_st[2], main = (title = "PCA_scaled"))

With scaling analysis shows a more reliable result. Rwanda seems to be an outlier in this 2-dimensional biplot. PC1 + PC2 are accounted for 68% of the variance which is quite a lot.

Arrow show similar effect in GII - gender inequality categories (labour and parliament) and welfare categories (education, income, life expectancy and maternal health). Seeing the angles between these groups, it seems that they correlate quite well with each other.

And now for somethin completely different - Tea

data(tea)

# What is tea all about

dim(tea)
## [1] 300  36
str(tea)
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
##  $ frequency       : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
summary(tea)
##          breakfast           tea.time          evening          lunch    
##  breakfast    :144   Not.tea time:131   evening    :103   lunch    : 44  
##  Not.breakfast:156   tea time    :169   Not.evening:197   Not.lunch:256  
##                                                                          
##                                                                          
##                                                                          
##                                                                          
##                                                                          
##         dinner           always          home           work    
##  dinner    : 21   always    :103   home    :291   Not.work:213  
##  Not.dinner:279   Not.always:197   Not.home:  9   work    : 87  
##                                                                 
##                                                                 
##                                                                 
##                                                                 
##                                                                 
##         tearoom           friends          resto          pub     
##  Not.tearoom:242   friends    :196   Not.resto:221   Not.pub:237  
##  tearoom    : 58   Not.friends:104   resto    : 79   pub    : 63  
##                                                                   
##                                                                   
##                                                                   
##                                                                   
##                                                                   
##         Tea         How           sugar                     how     
##  black    : 74   alone:195   No.sugar:155   tea bag           :170  
##  Earl Grey:193   lemon: 33   sugar   :145   tea bag+unpackaged: 94  
##  green    : 33   milk : 63                  unpackaged        : 36  
##                  other:  9                                          
##                                                                     
##                                                                     
##                                                                     
##                   where                 price          age        sex    
##  chain store         :192   p_branded      : 95   Min.   :15.00   F:178  
##  chain store+tea shop: 78   p_cheap        :  7   1st Qu.:23.00   M:122  
##  tea shop            : 30   p_private label: 21   Median :32.00          
##                             p_unknown      : 12   Mean   :37.05          
##                             p_upscale      : 53   3rd Qu.:48.00          
##                             p_variable     :112   Max.   :90.00          
##                                                                          
##            SPC               Sport       age_Q          frequency  
##  employee    :59   Not.sportsman:121   15-24:92   1/day      : 95  
##  middle      :40   sportsman    :179   25-34:69   1 to 2/week: 44  
##  non-worker  :64                       35-44:40   +2/day     :127  
##  other worker:20                       45-59:61   3 to 6/week: 34  
##  senior      :35                       +60  :38                    
##  student     :70                                                   
##  workman     :12                                                   
##              escape.exoticism           spirituality        healthy   
##  escape-exoticism    :142     Not.spirituality:206   healthy    :210  
##  Not.escape-exoticism:158     spirituality    : 94   Not.healthy: 90  
##                                                                       
##                                                                       
##                                                                       
##                                                                       
##                                                                       
##          diuretic             friendliness            iron.absorption
##  diuretic    :174   friendliness    :242   iron absorption    : 31   
##  Not.diuretic:126   Not.friendliness: 58   Not.iron absorption:269   
##                                                                      
##                                                                      
##                                                                      
##                                                                      
##                                                                      
##          feminine             sophisticated        slimming  
##  feminine    :129   Not.sophisticated: 85   No.slimming:255  
##  Not.feminine:171   sophisticated    :215   slimming   : 45  
##                                                              
##                                                              
##                                                              
##                                                              
##                                                              
##         exciting          relaxing              effect.on.health
##  exciting   :116   No.relaxing:113   effect on health   : 66    
##  No.exciting:184   relaxing   :187   No.effect on health:234    
##                                                                 
##                                                                 
##                                                                 
##                                                                 
## 

Tea dataset includes 36 variables describing tea-taking habits with 300 observations. Most of the variables are strings and some are bimodal. This dataset is too large for a reasonable analysis and therefore I have picked variables with health attributes.

# column names to keep in the dataset
keep_columns <- c("Tea", "sugar", "pub", "friends", "sport", "healthy", "effect.on.health", "slimming", "relaxing")

# selecting the 'keep_columns' to create a new dataset
tea_health <- select(tea, one_of(keep_columns))
## Warning: Unknown columns: `sport`
# look at the summaries and structure of the tea_health data
summary(tea_health)
##         Tea           sugar          pub             friends   
##  black    : 74   No.sugar:155   Not.pub:237   friends    :196  
##  Earl Grey:193   sugar   :145   pub    : 63   Not.friends:104  
##  green    : 33                                                 
##         healthy               effect.on.health        slimming  
##  healthy    :210   effect on health   : 66     No.slimming:255  
##  Not.healthy: 90   No.effect on health:234     slimming   : 45  
##                                                                 
##         relaxing  
##  No.relaxing:113  
##  relaxing   :187  
## 
str(tea_health)
## 'data.frame':    300 obs. of  8 variables:
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
gather(tea_health) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
## Warning: attributes are not identical across measure variables;
## they will be dropped

pairs <- ggpairs(tea_health)
pairs

Most of the tea drinkers drink earl grey with friends. 210 think tea is healthy but only 66 think tea has an effect on health. Majority (187) think tea is relaxing. Sugar is used more often with Earl Grey

# multiple correspondence analysis
mca <- MCA(tea_health, graph = FALSE)
# summary of the model
summary(mca)
## 
## Call:
## MCA(X = tea_health, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6
## Variance               0.177   0.175   0.153   0.129   0.123   0.112
## % of var.             15.749  15.527  13.567  11.437  10.899   9.959
## Cumulative % of var.  15.749  31.275  44.842  56.279  67.178  77.137
##                        Dim.7   Dim.8   Dim.9
## Variance               0.099   0.087   0.071
## % of var.              8.804   7.731   6.329
## Cumulative % of var.  85.940  93.671 100.000
## 
## Individuals (the 10 first)
##                Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3    ctr
## 1           | -0.309  0.179  0.087 |  0.395  0.298  0.142 | -0.111  0.027
## 2           | -0.530  0.528  0.259 |  0.622  0.739  0.357 |  0.121  0.032
## 3           | -0.202  0.076  0.086 | -0.251  0.120  0.133 | -0.036  0.003
## 4           | -0.129  0.031  0.025 | -0.184  0.065  0.052 | -0.346  0.261
## 5           |  0.133  0.033  0.020 |  0.269  0.138  0.082 | -0.136  0.040
## 6           | -0.350  0.230  0.191 |  0.043  0.003  0.003 | -0.114  0.028
## 7           | -0.202  0.076  0.086 | -0.251  0.120  0.133 | -0.036  0.003
## 8           | -0.610  0.700  0.390 |  0.323  0.200  0.110 |  0.221  0.106
## 9           |  0.133  0.033  0.020 |  0.269  0.138  0.082 | -0.136  0.040
## 10          | -0.610  0.700  0.390 |  0.323  0.200  0.110 |  0.221  0.106
##               cos2  
## 1            0.011 |
## 2            0.014 |
## 3            0.003 |
## 4            0.181 |
## 5            0.021 |
## 6            0.020 |
## 7            0.003 |
## 8            0.051 |
## 9            0.021 |
## 10           0.051 |
## 
## Categories (the 10 first)
##                 Dim.1     ctr    cos2  v.test     Dim.2     ctr    cos2
## black       |  -0.512   4.559   0.086  -5.064 |   0.560   5.528   0.103
## Earl Grey   |   0.363   5.991   0.238   8.438 |  -0.379   6.599   0.259
## green       |  -0.977   7.410   0.118  -5.940 |   0.959   7.243   0.114
## No.sugar    |  -0.360   4.719   0.138  -6.432 |   0.367   4.981   0.144
## sugar       |   0.385   5.044   0.138   6.432 |  -0.392   5.325   0.144
## Not.pub     |  -0.041   0.092   0.006  -1.366 |   0.267   4.033   0.268
## pub         |   0.153   0.348   0.006   1.366 |  -1.005  15.170   0.268
## friends     |   0.173   1.383   0.057   4.112 |  -0.340   5.416   0.218
## Not.friends |  -0.326   2.607   0.057  -4.112 |   0.641  10.207   0.218
## healthy     |  -0.488  11.773   0.556 -12.896 |  -0.227   2.584   0.120
##              v.test     Dim.3     ctr    cos2  v.test  
## black         5.537 |   0.936  17.710   0.287   9.264 |
## Earl Grey    -8.792 |  -0.108   0.613   0.021  -2.506 |
## green         5.831 |  -1.469  19.430   0.267  -8.928 |
## No.sugar      6.562 |   0.351   5.200   0.131   6.267 |
## sugar        -6.562 |  -0.375   5.559   0.131  -6.267 |
## Not.pub       8.957 |  -0.092   0.546   0.032  -3.080 |
## pub          -8.957 |   0.345   2.053   0.032   3.080 |
## friends      -8.079 |   0.085   0.382   0.013   2.006 |
## Not.friends   8.079 |  -0.159   0.720   0.013  -2.006 |
## healthy      -5.999 |   0.021   0.025   0.001   0.549 |
## 
## Categorical variables (eta2)
##                    Dim.1 Dim.2 Dim.3  
## Tea              | 0.255 0.271 0.461 |
## sugar            | 0.138 0.144 0.131 |
## pub              | 0.006 0.268 0.032 |
## friends          | 0.057 0.218 0.013 |
## healthy          | 0.556 0.120 0.001 |
## effect.on.health | 0.345 0.131 0.181 |
## slimming         | 0.043 0.010 0.379 |
## relaxing         | 0.017 0.235 0.023 |
# visualize MCA
plot(mca, invisible=c("ind"), habillage = "quali")

With this analysis we can see that 31.275% of this models variance can be explained with the first 2 dimensions. It seems that social aspect is more affected by the Dim 2 and ohysical health with Dim 1. According to minor clusters it seems that people drinking tea with friens use sugar and they think tea is relaxing compaired with people drinking tea without friends use black tea with no sugar.